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Abstract 



In this paper, we analyze the celebrated EM algorithm from the point of view of 
proximal point algorithms. More precisely, we study a new type of generalization of 
the EM procedure introduced in [3] and called Kullback-proximal algorithms. The 
proximal framework allows us to prove new results concerning the cluster points. An 
essential contribution is a detailed analysis of the case where some cluster points lie 
on the boundary of the parameter space. 

1 Introduction 

The problem of maximum likelihood (ML) estimation consists of finding a solution of the 
form 



where y is an observed sample of a random variable Y denned on a sample space 3^ and 
l y {6) is the log-likelihood function defined by 



defined on the parameter space 9 C M n , and g(y,9) denotes the density of Y at y 
parametrized by the vector parameter 6. 

The Expectation Maximization (EM) algorithm is an iterative procedure which is widely 
used for solving ML estimation problems. The EM algorithm was first proposed by Demp- 
ster, Laird and Rubin [8] and has seen the number of its potential applications increase 
substantially since its appearance. The book of McLachlan and Krishnan [T3] gives a 
comprehensive overview of the theoretical properties of the method and its applicability. 

The convergence of the sequence of EM iterates towards a maximizer of the likelihood 
function was claimed in the original paper [H] but it was later noticed that the proof 
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contained a flaw. A careful convergence analysis was finally given by Wu [21] based on 
Zangwill's general theory [23]; see also [H]. Zangwill's theory applies to general iterative 
schemes and the main task when using it is to verify that the assumptions of Zangwill's 
theorems are satisfied. Since the appearance of Wu's paper, convergence of the EM al- 
gorithm is often taken for granted in many cases where the necessary assumptions were 
sometimes not carefully justified. As an example, an often neglected issue is the behav- 
ior of EM iterates when they approach the boundary of the domain of definition of the 
functions involved. A different example is the following. It is natural to try and establish 
that EM iterates actually converge to a single point 9*, which involves proving uniqueness 
of the cluster point. Wu's approach, reported in [HJ Theorem 3.4, p. 89] is based on 
the assumption that the euclidean distance between two successive iterates tends to zero. 
However such an assumption is in fact very hard to verify in most cases and should not be 
deduced solely from experimental observations. 

The goal of the present paper is to propose an analysis of EM iterates and their gen- 
eralizations in the framework of Kullback proximal point algorithms. We focus on the 
geometric conditions that are provable in practice and the concrete difficulties concerning 
convergence towards boundaries and cluster point uniqueness. The approach adopted here 
was first proposed in [I] in which it was shown that the EM algorithm could be recast as 
a Proximal Point algorithm. A proximal scheme for maximizing the function l y {9) using 
the distance-like function I y is an iterative procedure of the form 



where ((3k) ken is a sequence of positive real numbers often called relaxation parameters. 
Proximal point methods were introduced by Martinet [13] and Rockafellar [T7] in the 
context of convex minimization. The proximal point representation of the EM algorithm 
[I] is obtained by setting = 1 and I y {9, 9 k ) to the Kullback distance between some well 
specified conditional densities of a complete data vector. The general case of (3 k > was 
called the Kullback Proximal Point algorithm (KPP). This approach was further developed 
in [5] where convergence was studied in the twice different iable case with the assumption 
that the limit point lies in the interior of the domain. The main novelty of [5] was to 
prove that relaxation of the Kullback-type penalty could ensure superlinear convergence 
which was confirmed by experiment for a Poisson linear inverse problem. This paper is an 
extension of these previous works that addresses the problem of convergence under general 
conditions. 

The main results of this paper are the following. Firstly, we prove that all the clus- 
ter points of the Kullback proximal sequence which lie in the interior of the domain are 
stationary points of the likelihood function l y under very mild assumptions that are easily 
verified in practice. Secondly, taking into account finer properties of I y , we prove that every 
cluster point on the boundary of the domain satisfies the Karush-Kuhn- Tucker necessary 
conditions for optimality under nonnegativity constraints. To illustrate our results, we 
apply the Kullback-proximal algorithm to an estimation problem in animal carcinogenicity 
introduced in [1] in which an interesting nonconvex constraint is handled. In this case, the 
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M-step cannot be obtained in closed form. However, the Kullback-proximal algorithm can 
be analyzed and implemented. Numerical experiments are provided which demonstrate 
the ability of the method to significantly accelerate the convergence of standard EM. 

The paper is organized as follows. In Section |2} we review the Kullback proximal point 
interpretation of EM. Then, in Section |3] we study the properties of interior cluster points. 
We prove that such cluster points are in fact global maximizers of a certain penalized 
likelihood function. This allows us to justify using a relaxation parameter /3 when /3 is 
sufficiently small to permit avoiding saddle points. Section H] pursues the analysis in the 
case where the cluster point lies on a boundary of the domain of I y . 

2 The Kullback proximal framework 

In this section, we review the EM algorithm and the Kullback proximal interpretation 
discussed in [5]. 

2.1 The EM algorithm 

The EM procedure is an iterative method which produces a sequence (9 k )ken such that 
each 9 k+l maximizes a local approximation of the likelihood function in the neighborhood 
of 9 k . This point of view will become clear in the proximal point framework of the next 
subsection. 

In the traditional approach, one assumes that some data are hidden from the observer. 
A frequent example of hidden data is the class to which each sample belongs in the case 
of mixtures estimation. Another example is when the observed data are projection of an 
unkown object as for image reconstruction problems in tomography. One would prefer 
to consider the likelihood of the complete data instead of the ordinary likelihood. Since 
some parts of the data are hidden, the so called complete likelihood cannot be computed 
and therefore must be approximated. For this purpose, we will need some appropriate 
notations and assumptions which we now describe. The observed data are assumed to be 
i.i.d. samples from a unique random vector Y taking values on a data space y. Imagine 
that we have at our disposal more informative data than just samples from Y. Suppose 
that the more informative data are samples from a random variable X taking values on a 
space X with density f(x;9) also parametrized by 9. We will say that the data X is more 
informative than the actual data Y in the sense that Y is a compression of X, i.e. there 
exists a non- invert ible transformation h such that Y = h(X). If one had access to the data 
X it would therefore be advantageous to replace the ML estimation problem ([!]) by 



with l x {9) = logf(x; 9). Since y = h(x) the density g of Y is related to the density f of X 
through 



9 M l = argmax eeRP / a; (6'), 
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for an appropriate measure n on X. In this setting, the data y are called incomplete data 
whereas the data x are called complete data. 

Of course the complete data x corresponding to a given observed sample y are unknown. 
Therefore, the complete data likelihood function l x (8) can only be estimated. Given the 
observed data y and a previous estimate of 9 denoted 8, the following minimum mean 
square error estimator (MMSE) of the quantity l x {9) is natural 

Q(9,9) = E[\og f(x;9)\y;9], 

where, for any integrable function F(x) on X, we have defined the conditional expectation 

E[F(x)\y;6]= I F(x)k{x\y;9)d^{x) 
Jh-H{y}) 

and k(x\y; 9) is the conditional density function given y 

k(x\y;9) = - (6) 

Having described the notions of complete data and complete likelihood and its local 
estimation we now turn to the EM algorithm. The idea is relatively simple: a legitimate 
way to proceed is to require that iterate 9 k+1 be a maximizer of the local estimator of 
the complete likelihood conditionally on y and 9 k . Hence, the EM algorithm generates a 
sequence of approximations to the solution (jlj) starting from an initial guess 9° of 9ml and 
is defined by 

Compute Q{9,9 k ) = E[log f(x; 9)\y- 9 k ] E Step 
9 k+l = argmax eeRP Q(#, 9 k ) M Step 

2.2 Kullback proximal interpretation of the EM algorithm 

Consider the general problem of maximizing a concave function The original proximal 

point algorithm introduced by Martinet [13] is an iterative procedure which can be written 

9 k+1 = argmax^ {$(0) - ^\\9 - 9 k \\ 2 } . (7) 

The quadratic penalty |||0 — 9 k \\ 2 is relaxed using a sequence of positive parameters 

In [T7j, Rockafellar showed that superlinear convergence of this method is obtained when 

the sequence {(3k} converges towards zero. 

It was proved in [5] that the EM algorithm is a particular example in the class of 
proximal point algorithms using Kullback Leibler types of penalties. One proceeds as 
follows. Assume that the family of conditional densities {k(x\y; 9)} 8eR p is regular in the 
sense of Ibragimov and Khasminskii [9], in particular k(x\y; 9)fi(x) and k(x\y; 9)fi(x) are 
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mutually absolutely continuous for any 9 and 9 in M p . Then the Radon-Nikodym derivative 

k{x\y,9) 
k(x\yfi) 



k ( x \ y ' 9 ' > exists for all 9, 9 and we can define the following Kullback Leibler divergence: 



W) = E[logM^M]. (8) 

L k(x\y;9) J 

We are now able to define the Kullback-proximal algorithm. For this purpose, let us define 
D\ as the domain of l y , the domain of I y (-, 9) and Di the domain of I y (-, ■). 

Definition 2.2.1. Let ((3k)km be a sequence of positive real numbers. Then, the Kullback- 
proximal algorithm is defined by 

9 k+1 = argmax eeD;nD/ flfc Z^) - (3 k I y (9, 9 k ). (9) 

The main result on which the present paper relies is that EM algorithm is a special 
case of ()9]), i.e. it is a penalized ML estimator with proximal penalty I y (9,9 k ). 

Proposition 2.2.2. [5, Proposition 1] The EM algorithm is a special instance of the 
Kullback-proximal algorithm with flk — 1, for all k G N. 

The previous definition of the Kullback proximal algorithm may appear overly general 
to the reader familiar with the usual practical interpretation of the EM algorithm. However, 
we found that such a framework has at least the three following benefits [5]: 

• to our opinion, the convergence proof of our EM is more natural, 

• the Kullback proximal framework may easily incorporate additional constraints, a 
feature that may be of crucial importance as demonstrated in the example of Section 
15.11 below, 

• the relaxation sequence ((3k) kern allows one to weight the penalization term and its 
convergence to zero implies quadratic convergence in certain examples. 

The first of these three arguments is also supported by our simplified treatment of the 
componentwise EM procedure proposed in [3] and the remarkable recent results of [20] 
based on a special proximal entropic representation of EM for getting precise estimates on 
the convergence speed of EM algorithms, however, with much more restrictive assumptions 
than the ones of the present paper. 

Although our results are obtained under mild assumptions concerning the relaxation 
sequence (f3k)kem including the case (3k = 0, several precautions should be taken when 
implementing the method. However, one of the key features of EM-like procedures is to 
allow easy handling of positivity or more complex constraints, such as the ones discussed in 
the example of Section 15.11 In such cases the function I y behaves like a barrier whose value 
increases to infinity as the iterates approach the boundary of the constraint set. Hence, 
the sequence ((3k)k<=n ought to be positive in order to exploit this important computational 
feature. On the other hand, as proved under twice differentiability assumptions in [5] when 
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the cluster set reduces to a unique nondegenerate maximizer in the interior of the domain 
of the log-likelihood and fik converges to zero, quadratic convergence is obtained. This nice 
behavior is not satisfied in the plain EM case where /3fe = 1 for all k G N. As a drawback, 
one problem in decreasing the /3fc's too quickly is possible numerical ill conditioning. The 
problem of choosing the relaxation sequence is still largely open. We have found however 
that for most "reasonable" sequences, our method was at least as fast as the standard EM. 

Finally, we would like to end our presentation of KPP-EM by noting that closed form 
iterations may not be available in the case flk ^ 1. If this is the case, solving (Q becomes a 
subproblem which will require iterative algorithms. In some interesting examples, e.g. the 
case presented in Section 15.11 In this case, the standard EM iterations are not available 
in closed form in the first place and KPP-EM provides faster convergence while preserving 
monotonicity and constraint satisfaction. 

2.3 Notations and assumptions 

The notation || • || will be used to denote the norm on any previously defined space without 
more precision. The space on which it is the norm should be obvious from the context. For 
any bivariate function $, Vi$ will denote the gradient with respect to the first variable. 
In the remainder of this paper we will make the following assumptions. 

Assumptions 2.3.1. (i) l y is differentiable on D\ and l y (9) tends to — oo whenever \\9\\ 
tends to +oo. 

(ii) the projection of Dj onto the first coordinate is a subset of D\. 

(iii) (f3k)keN is a convergent nonnegative sequence of real numbers whose limit is denoted 
by 0*. 

We will also impose the following assumptions on the distance-like function I y . 

Assumptions 2.3.2. (i) There exists a finite dimensional euclidean space S, a differen- 
tiable mapping t : Di h->- S and a functional \I> : C 5 x 5 H- 1 such that 

I y (9,9) = ^(t(9),t(9)), 

where denotes the domain of 

(ii) For any {t k ,t)ken} C there exists p t > such that ]ha\\ t k^ t n^. 0O Iy(t k ,t) > pt- 
Moreover, we assume that inf t£ M Pt > for any bounded set M C S. 

For all (t',t) in D^, we will also require that 

(iii) (Positivity) > 0, 

(iv) (Identifiability) tf(t',t) = 0^t = t', 

(v) (Continuity) $ is continuous at (£',£) 

and for all t belonging to the projection of onto its second coordinate, 

(vi) (Differentiability) the function *&(-,t) is differentiable at t. 

Assumptions I2.3.1( i) and (ii) on l y are standard and are easily checked in practical 
examples, e.g. they are satisfied for the Poisson and additive mixture models. Notice that 
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the domain Dj is now implicitly denned by the knowledge of Di and D^. Moreover I y is 
continuous on Dj. The importance of requiring that I y has the prescribed shape comes from 
the fact that I y might not satisfy assumption I2.3.2( iv) in general. Therefore assumption 
12.3.21 (iv) reflects the requirement that I y should at least satisfy the identifiability property 
up to a possibly injective transformation. In both examples discussed above, this property 
is an easy consequence of the well known fact that alog(a/6) = implies a = b for positive 
real numbers a and b. The growth, continuity and differentiability properties 12.3.21 (ii). (v) 
and (vi) are, in any case, nonrestrictive. 

For the sake of notational convenience, the regularized objective function with relax- 
ation parameter (3 will be denoted 

F (9,9) = l y (9)-f3I y (9,9). (10) 

Finally we make the following general assumption. 

Assumptions 2.3.3. The Kullback proximal iteration is well defined, i.e. there exists 
at least one maximizer of Fpk(9,9 k ) at each iteration k. 

In the EM case, i.e. (3 = 1, this last assumption is equivalent to the computability 
of M-steps. A sufficient condition for this assumption to hold would be, for instance, 
that F/3(9,9) be sup-compact, i.e. the level sets {9 \ Fp(9,9) > a} be compact for all a, 
(3 > and 9 e D[. However, this assumption is not usually satisfied since the distance-like 
function is not defined on the boundary of its domain. In practice it suffices to solve the 
equation VF fS k(9,9 k ) = 0, to prove that the solution is unique. Then assumption 12.3. l( i) 
is sufficient to conclude that we actually have a maximizer. 

2.4 General properties : monotonicity and boundedness 

Using Assumptions we easily deduce monotonicity of the likelihood values and bound- 
edness of the proximal sequence. The first two lemmas are proved, for instance, in [5]. 
We start with the following monotonicity result. 

Lemma 2.4.1. [5, Proposition 2} For any iteration fceN, the sequence (9 k )ken satisfies 

/ tf (0*+i) _ i y (e k ) > (3 k I y (9 k , 9 k+1 ) > 0. (11) 

From the previous lemma, we easily obtain the boundedness of the sequence. 
Lemma 2.4.2. J2> Lemma 2] The sequence (9 k )k£N is bounded. 

The next lemma will also be useful. 

Lemma 2.4.3. Assume that there exists a subsequence (9 a ^)km belonging to a compact 
set C included in D\. Then, 

lim p k I y (9 k+1 ,9 k )=0. 

k— >oo 
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Proof. Since l y is continuous over C, sup e6C l y {9) < +00 and (l y (9 a ^))keN is therefore 
bounded from above. Moreover, Lemma 12.4.11 implies that the sequence (l y (9 k ))ken is 
monotone nondecreasing. Therefore, the whole sequence (l y (9 k ))ken is bounded from above 
and convergent. This implies that lim^oo ly(9 k+1 ) ~ ly(9 h ) = 0- Applying Lemma [2.4.11 
again, we obtain the desired result. □ 



3 Analysis of interior cluster points 

The convergence analysis of Kullback proximal algorithms is split into two parts, the first 
part being the subject of this section. We prove that if the accumulation points 9* of 
the Kullback proximal sequence satisfy (9*, 9*) G Dj they are stationary points of the 
log-likelihood function l y . It is also straightforward to show that the same analysis applies 
to the case of penalized likelihood estimation. 

3.1 Nondegeneracy of the Kullback penalization 

We start with the following useful lemma. 

Lemma 3.1.1. Let (acfykeN and {pfyk&i be two bounded sequences in satisfying 

lim *(ai,«2) = 0. 

k— too 

Assume that every couple (a*, a 2 ) of accumulation points of these two sequences lies in 
Dq, . Then, 

lim 1 1 a* - a£|| = 0. 

k— ¥00 

Proof. First, one easily obtains that (a^fceN is bounded (use a contradiction argument 
and Assumption 12.3.21 (ii)). Assume that there exits a subsequence (ai^)ken such that 
|| a£ — a£ || — 3e for some e > and for all large k. Since (ati^)keN is bounded, 
one can extract a convergent subsequence. Thus we may assume without any loss of 
generality that (a^ J&gN is convergent with limit a*. Using the triangle inequality, we 
have IIcki — a*|| + ||ai— alj || > 3e. Since (al^)ken converges to a*, there exists a integer 
K such that k > K implies ||afi — a*|| < e. Thus for k > K we have \\a\ — a 2 || > 2e. 
Now recall that (a^fceN is bounded and extract a convergent subsequence (0% )k>K with 
limit denoted by a* 2 . Then, using the same arguments as above, we obtain ||ck* — a||| > e. 
Finally, recall that lim*-**, ^{a\ , a£) = 0. We thus have lim*-** ^K (7(fc)) , a 2 {l{k)) ) = 0, 
and, due to the fact that the sequences are bounded and •) is continuous in both 
variables, we have Iy^a^a^) = 0. Thus assumption 12.3.21 (iv) implies that ||a* — a 2 \\ = 
and we obtain a contradiction. Hence, lim^oo || ck* — a^H = as claimed. □ 
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3.2 Cluster points 



The main results of this section are the following. First, we prove that under the assump- 
tions EXU 12.3.21 and 12.3.31 any cluster point 9* is a global maximizer of F/3*(9*,9*). We 
then use this general result to prove that such cluster points are stationary points of the 
log-likelihood function. This result motivates a natural assumption under which 9* is in 
fact a local maximizer of l y . In addition we show that if the sequence (0 k )k^n converges to 
zero, i.e. 0* = 0, then 9* is a global maximizer of log-likelihood. Finally, we discuss some 
simple conditions under which the algorithm converges, i.e. has only one cluster point. 

The following theorem states a result which describes the stationary points of the 
proximal point algorithm as global maximizers of the asymptotic penalized function. 

Theorem 3.2.1. Assume that (3* > 0. Let 9* be any accumulation point of (9 k )keN- As- 
sume that (9*, 9*) E Dj. Then, 9* is a global maximizer of the penalized function Fp*(-, 9*) 
over the projection of Dj onto its first coordinate, i.e. 

Fp»{9*,9*) > F(9,9*) 

for all 9 such that (9,9*) ED^ 

An informal argument is as follows. Assume that G = MJ 1 . From the definition of the 
proximal iterations, we have 

for all subsequence (9 a ^)keN converging to 9* and for all 9 E 0. Now, assume we can 
prove that 9 a ^ also converges to 9*, we obtain by taking the limit and using continuity, 
that 

F^{9*,9*)>Fp,{9,9*) 

which is the required result. There are two major difficulties when one tries to transform 
this sketch into a rigorous argument. The first one is related to the fact that l y and I y 
are only defined on domains which may not to be closed. Secondly, proving that 0°w 
converges to 9* is not an easy task. This issue will be discussed in more detail in the next 
section. The following proof overcomes both difficulties. 

Proof. Without loss of generality, we may reduce the analysis to the case where 
0k > > f° r a certain 0. The fact that 9* is a cluster point implies that there is a 
subsequence of (9 k )kem converging to 9*. For k sufficiently large, we may assume that 
the terms (9 a<yk+l \ belong to a compact neighborhood C* of (9*, 9*) included in Dj. 
Recall that 

for all 9 such that (9,9 a ^- 1 ) E D 1 and a fortiori for (9,9 a ^- 1 ) E C* . Therefore, 

Fp{&* k \&M- x ) -{0 k -0*)I y {9< k \9< k )- 1 ) > 

F^(9,9^- 1 ) - (0 aikyi -0*)I y (9,9°W- x ). 
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Let us have a precise look at the "long term" behavior of I y . First, since (3k > (3* for 
all k sufficiently large, Lemma 12.4.31 says that 

lim I y (d a{h \d' j{k) - 1 ) = 0. 

k— >oo 

Thus, for any e > 0, there exits an integer K\ such that I y (9 a<yk \ 9 a ^ +1 ) < e for all k > K x . 
Moreover, Lemma 13.1.11 and continuity of t allows to conclude that 

lim t{9 a{k) - 1 ) = t(9*). 

k— >oo 

Since \I/ is continuous, for all e > and for all k sufficiency large we have 

I y (9*,9*) = tf(t(0*),i(0*)) 

>*(t(^(*)),t(^(*)- 1 ))-e (13) 
= I v \e ff W,6 ff W- 1 )-e. 

On the other hand, Fp* is continuous in both variables on C*, due to Assumptions 
12.3. l( i) and I2.3.27 i). By continuity in the first and second arguments of Fp*(-, •), for any 
e > there exists K 2 G N such that for all k > K2 

F^(9*,6)<F^(9^ k \9) + e. (14) 

Using ffl3l) . since l y is continuous, we obtain existence of K 3 such that for all k > K 3 

F/3* (9\ 9*) > Fp* (9 a{k \ 9 a{k)+1 ) - 2e. (15) 

Combining equations ( fT4j) and ( |T5i) with ( TT2l . we obtain 

F?*(9*,9*)> F^(9*,9)-(f3 k -f3*)I y (9^ k \9) 

+{p k - 0*)I y (9°W,9°M +1 )) - 3e. { } 

Now, since (3* = limfc^oo (3k, there exists an integer K4 such that (3k — (3* < e for all k > K^. 
Therefore for all k > max-fi^, K 2 , K 3 , K4}, we obtain 

Fp*(9*,9*) > Fp*(9*,9)-el y (9 a ( k \9)-e 2 -3e. 

Since I y is continuous and (9 a ^)km is bounded, there exists a real constant K such that 
I y {9 u{k \9) < K, for all n e N. Thus, for all k sufficiently large 

Fp*(9*,9*) > Fp*(9*,9) - (AeK + e 2 ). (17) 

Finally, recall that no assumption was made on 9, and that C* is any compact neighborhood 
of 9*. Thus, using the assumption 12.3. lT i). which asserts that l y (9) tends to —00 as \\9\\ 
tends to +00, we may deduce that (|T7|) holds for any 9 such that (9, 9*) 6 Dj and, letting 
e tend to zero, we see that 9* maximizes Fp*(9,9*) for over all 9 such that (9,9*) belongs 
to Di as claimed. □ 
Using this theorem, we may now deduce that certain accumulation points on the strict 
interior of the parameter's space are stationary points of the log-likelihood function. 
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Corollary 3.2.2. Assume that (3* > 0. Let 9* be any accumulation point of (9 k )k&- 
Assume that (6*, 8*) G intDj. Then, if l y is differentiable on Di, 9* is a stationary point 
of l y (9). Moreover, if l y is concave, then 6* is a global maximizer ofl y . 

Proof. Since under the required assumptions l y is differentiable and I y (9*, •) is differ- 
entiable at 9*, Theorem 13 . 2 . 1 1 states that 



Since I y (-, 9*) is minimum at 9*, V\I y (9*, 9*) = and we thus obtain that 9* is a stationary 
point of l y . This implies that 9* is a global maximizer in the case where l y is concave. □. 

Theorem 13.2.11 seems to be much stronger than the previous corollary. The fact that ac- 
cumulation points of the proximal sequence may not be global maximizers of the likelihood 
is now easily seen to be a consequence of fact that the Kullback distance-like function I y 
perturbs the shape of the likelihood function when 9 is far from 6*. This perturbation does 
not have serious consequence in the concave case. On the other hand, one may wonder 
whether 9* cannot be proved to be at least a local maximizer instead of a mere stationary 
point. The answer is given in the following corollary. 

Corollary 3.2.3. Let 9* be an accumulation point of (9 k )ken such that (9*, 9*) e intDj. 
In addition, assume that l y and I y (-, 9*) are twice differentiable in a neighborhood of 9* and 
that the Hessian matrix V 2 l y {9*) at 9* is not the null matrix. Then, if (3* is sufficiently 
small, 9* is a local maximizer of l y over Di . 

Proof. Assume that 9* is not a local maximizer. Since V 2 / y is not the null matrix, 
for (3* sufficiently small, there is a direction 5 in the tangent space to D\ for which the 
function f(t) = Fp*{9* + 18, 9*) has positive second derivative for t sufficiently small. This 
contradicts the fact that 9* is a global maximizer of Fp*(-, 9*) and the proof is completed. 



The next theorem establishes global optimality of accumulation points in the case where 
the relaxation sequence converges to zero. 

Theorem 3.2.4. Let 9* be any accumulation point of (9 h )keN- Assume that (9*, 9*) 6 Dj. 
Then, without assuming differentiability of either l y or of I y , if ((3k) ken converges to zero, 
9* is a global maximizer of l y over the projection of Dj along the first coordinate. 

Proof. Let (9 a ^) keN be a convergent subsequence of (9 k )k&n with limit denoted 9*. We 
may assume that for k sufficiently large, (9 a( - k+1 \ 9 a ^) belongs to a compact neighborhood 
C* of 9*. By continuity of l y , for any e > 0, there exists K 6 N such that for all k > K, 




□ 



i y (e*) > i y {e< k) ) 



— e. 



On the other hand, the proximal iteration (J3]) implies that 



ly(0 a{k) ) - ^ { k)-iI y (e^-\9^) > l y (9)-f3 a{k) ^I y (9^-\9), 
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€. 



for all 9 G D h Fix 9 G A- Thus, for all k > K, 

i y {9*) > i y (9) + p <k) _ x i v (erw-\ &M) - Paw-Me*™- 1 , 9) 

Since 1^ is a nonnegative function and (/3fc)fceN is a nonnegative sequence, we obtain 

i y {o*)>i y {o)-i3 a{k) - x i y {o°w-\e) 



e. 



Recall that (9 k ) keN is bounded due to Lemma 12.4.21 Thus, since I y is continuous, there 
exists a constant C such that I y (9 a ^ k ^ 1 , 9) < C for all k. Therefore, for k greater than K, 

l y (9*)>l y (9)-P aikyi C-e. 

Passing to the limit, and recalling that (/3fc)fceN tends to zero, we obtain that 

l y (9*) < l y (9) - e. 

Using the same argument as at the end of the proof of Theorem 13.2. 1[ this latter equation 
holds for any 9 such that (9, 9*) belongs to Dj, which concludes the proof upon letting e 
tend to zero. □ 



3.3 Convergence of the Kullback proximal sequence 

One question remains open in the analysis of the previous section: does the sequence 
generated by the Kullback proximal point converge? In other words: are there multiple 
cluster points? In Wu's paper |21j . the answer takes the following form. If the euclidean 
distance between two successive iterates tends to zero, a well known result states that 
the set of accumulation points is a continuum (see for instance [T5J Theorem 28.1]) and 
therefore, it is connected. Therefore, if the set of stationary points of l y is a countable set, 
the iterates must converge. 

Theorem 3.3.1. Let S* denote the set of accumulation points of the sequence (0 )kgs- 
Assume that lim^oo \\9 k+1 — 9 k \\ =0 and that l y (0) is strictly concave in an open neighbor- 
hood M of an accumulation point 9* of (9 )keN an d that (9*, 9*) is in intDj. Then, for any 
relaxation sequence (j3 k ) ke fq, the sequence (9 k ) ke fq converges to a local maximizer of l y {9). 

Proof. We obtained in Corollary 13.2.21 that every accumulation point 9* of (9 k ) kG ^ in 
intDi y and such that (9*, 9*) G intD^ is a stationary point of l y (0). Since l y (0) is strictly 
concave over A/", the set of stationary points of l y belonging to M reduces to singleton. 
Thus 9* is the unique stationary point in M of l y , and a fortiori, the unique accumulation 
point of (9 k ) k£ fq belonging to Af. To complete the proof, it remains to show that there is 
no accumulation point in the exterior of J\f. For that purpose, consider an open ball B of 
center 9* and radius e included in J\f. Then, x* is the unique accumulation point in B. 
Moreover, any accumulation point 9', lying in the exterior of Af must satisfy \\9* — 0'\\ > e, 
and we obtain a contradiction with the fact that S* is connected. Thus every accumulation 
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point lies in J\f, from which we conclude that 9* is the only accumulation point of (6 k )keN or, 
in other words, that (9 h )keN converges towards 9*. Finally, notice that the strict concavity 
of l y {9) over M implies that 9* is a local maximizer. □ 
Before concluding this section, let us make two general remarks. 

• Proving a priori that the set of stationary points of l y is discrete may be a hard task 
in specific examples. 

• In general, it is not known whether lim^oo ||# fc+1 — 9 k \\ =0 holds. In fact, Lemma 
13. 1.11 could be a first step in this direction. Indeed if we could prove in any application 
that the mapping t is injective, the desired result would follow immediately. However, 
injectivity of t does not hold in many of the standard examples; in the case of 
Gaussian mixtures, see [3l Section 2.2] for instance. Thus we are now able to clearly 
understand why the assumption that lirn^oo \\9 k+1 — 9 k \\ = is not easily deduced 
from general arguments. This problem has been overcome in [3] where it is shown that 
t is componentwise injective and thus performing a componentwise EM algorithm is 
a good alternative to the standard EM. 



4 Analysis of cluster points on the boundary 

The goal of this section is to extend the previous results to the case where some cluster 
points lie on the boundary of the region where computation of proximal steps is well de- 
fined. Such cluster points have rarely been analyzed in the statistical literature and the 
strategy developed for the interior case cannot be applied without further study of the 
Kullback distance-like function. Notice further that entropic-type penalization terms in 
proximal algorithms have been the subject of an intensive research effort in the mathe- 
matical programming community with the goal of handling positivity constraints; see [19] 
and the references therein for instance. The analysis proposed here applies to the more 
general Kullback distance-like functions I y that occur in EM. Our goal is to show that 
such cluster points satisfy the well known Karush-Kuhn- Tucker conditions of nonlinear 
programming which extend the stationarity condition Vl y (9) = to the case where 9 is 
subject to constraints. As before, it is straightforward to extend the proposed analysis to 
the case of penalized likelihood estimation. 

In the sequel, the distance-like function will be assumed to have the following additional 
properties. 



Assumptions 4.0.2. The Kullback distance-like function I y is of the form 

I y (9,9) = ^(viM e )K¥§i)' 



where for all i and j , Uj is continuously differentiate on its domain of definition, aiij is 
a function from y to W + , the set of positive real numbers, and the function <p is a non 
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negative convex continuously differentiable function defined for positive real numbers only 
and such that 0(r) = if and only if r = 1. 

If tij (9) = 9i and = 1 for all % and all j, the function I y is the well known divergence 
defined by Csiszar in [7] . Assumption 14.0.21 is satisfied in most standard examples (for 
instance Gaussian mixtures and Poisson inverse problems) with the choice <f>(r) = r log(r). 



4.1 More properties of the Kullback distance-like function 

The main property that will be needed in the sequel is that under Assumption 14.0.21 the 
function I y satisfies the same property as the one given in Lemma [3.1.11 above . even on the 
boundary of its domain Dj. This is the result of Proposition 14.1.21 below. We begin with 
one elementary lemma. 



Lemma 4.1.1. Under Assumptions \4-0~<\ the function <fi is decreasing on (0,1), is in- 
creasing on (l,+oo) and 0(r) converges to +oo when r converges to +oo. We have 
lim fc ^ +00 4>(r k ) = if and only if\im k ^ +OQ T k = 1. 

Proof. The first statement is obvious. For the second statement, the "if part is 
trivial, so we only prove the "only if part. First notice that the sequence (r k ) keN must be 
bounded. Indeed, the level set {r | <p{r) < 7} is bounded for all 7 > and contains the 
sequence {r k ) k >K for K sufficiently large. Thus, the Bolzano- Weierstass theorem applies. 
Let t* be an accumulation point of (r fc ) fcgN . Since (f> is continuous, we get that 4>{r*) = 
and thus we obtain r* = 1. From this, we deduce that the sequence has only one cluster 
point, which is equal to 1. Therefore, \im k ^ +00 r k = 1. □ 

Using these lemmas, we are now in position to state and prove the main property of I y . 

Proposition 4.1.2. The following statements hold. 

(i) For any sequence (8 h )keN in and any bounded sequence (rj k )keN in M. + , the fact 
that limfc-j.+oo I y (fj k , 6 k ) = implies limfc_> +00 \tij(r] k ) — tij{6 k )\ = for all i,j such that 

^ 0. 

(ii) If one coordinate of one of the two sequences {Q k )km and {rj k )km tends to infinity, 
so does the other's same coordinate. 

Proof. Fix i in {1, . . . , n) and j in {1, . . . , m} and assume that ^ 0. 
(i) We first assume that (tij(r)f))ke$s is bounded away from zero. 

Since \im k ^ +00 I y (9 k ,r] k ) = 0, then \im k ^ +00 (j)(tij(8 k )/tij(r] k )) = and Lemma 14.1.11 
implies that lim fc ^ +00 t ij (9 k )/t ij (rj k ) = 1. Thus, \im k ^ +00 (t ij (6 k ) - t ij (r] k ))/ti j (r] k ) = and 
since t is continuous, tij{j] k ) is bounded. This implies that lim fc _ i . +00 \tij(9 k ) — tij(r] k ) \ = 0. 

Next, consider the case of a subsequence (iij(^ <T ^))feeN which tends towards zero. 
For contradiction, assume the existence of a subsequence (tij(9 a ^' l( - k ^) ke fq which remains 
bounded away from zero, i.e. there exists a > such that tij(9 a( -"'^) k& fq > a for k suffi- 
ciently large. Thus, for k sufficiently large we get 

U 3 {0^ m ) > a 1 
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and due to the fact that is increasing on (1, +00), we obtain 



a 



On the other hand, Lemma [4.1.11 says that for any b > 1, <f)'(b) > 0. Since <fi is convex, we 
get 

0(r)>0(6) + 0'(6)(r-6). 
Take r = a/tij(rj k ) in this last expression and combine with ( ITS"]) to obtain 

^MiSS?) - + Mi-^ - 0- 

Passing to the limit, we obtain 

= lim t tf (7ffr(fc))W MZ ) > o 0/( 6 ) > 0, 

which gives the required contradiction. 

(ii) If (tij(9 k )) keN +00 then {tij(r] k )) keN — > +00 is a direct consequence of part (i). 
Indeed, if tij(r] k ) remains bounded, part (i) says that lim fc _j. +00 \tij(rj k ) —tij{6 k ) \ = 0, which 
contradicts divergence of (tij(9 k ))keN- 

Now, consider the case where (tij(rj k ))km +00. Then, a contradiction is easily 
obtained if we assume that at least a subsequence (tij(0 a ^)kem stays bounded from above. 
Indeed, in such a case, we have 

lim —. rrrf = 0, 

fc^+oo t ij (r] a ( k >) 

and thus, 4>(tij(9 k )/tij(i] k )) > 7 for some 7 > since we know that is decreasing on (0, 1) 
and 0(1) = 0. This implies that 

which is the required contradiction. □ 



4.2 Cluster points are KKT points 

The main result of this section is the property that any cluster point 9* such that (9*, 9*) 
lies on the boundary of Dj satisfies the Karush-Kuhn- Tucker necessary conditions for 
optimality on the domain of the log-likelihood function. In the context of Assumptions 
13X21 Dj is the set 

Dj = {9 E R n I tij(9) > Vi E {1, . . . ,n} and j E {1, . . . ,m}}. 

We have the following theorem. 
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Theorem 4.2.1. Let 9* be a cluster point of the Kullback-proximal sequence. Assume that 
all the functions t^ are differentiable at 9* . Let X* be the set of all couples of indices 
such that the constraint Uj(6) > is active at 9*, i.e. Uj(9*) = 0. If 9* lies in the interior 
of D\, then 9* satisfies the Karush-Kuhn-Tucker necessary conditions for optimality, i.e. 
there exists a family of reals Ay, G X* such that 

Vl y (8*)+ AyV%(0*) = O. 



Proof. Let <&ij(6,6) denote the bivariate function defined by 



U 3 {9) 

Let {9°( k '}keN be a convergent subsequence of the proximal sequence with limit equal to 
9*. The first order optimality condition at iteration k is given by 



+Eii^(%)*y(^ ( * :) )Vi$(^ (&) ,r (fe) - 1 )) = o. 

We have 



(19) 



for all i and j. 

Claim A. For all such that aij(yj) ^ 0, we have 

lim ty^^Vi^^,^- 1 ) = 0. 

k— y +oo 



cr(fc) v 



Proof of Claim A. Two cases may occur. In the first case, we have ty(9*) = 0. 
Since the sequence {9 k } keN is bounded due to Lemma f2.4.2} continuous differentiability 
of 4> and the tjj proves that V \^{9 (T ^ k \9 a ^~ l ) is bounded from above. Thus, the de- 
sired conclusion follows. In the second case, Uj{9*) ^ and applying Lemma T2.4.31 we 
deduce that I y {9 a{ - k \ 9 a ^~ 1 ) tends to zero. Hence, lim fc ^ +00 $(0"( fc ), 9 u{k) ~ 1 ) = 0, which 
implies that lim^+oo 9 a( - k ^ / 9 <T ^ k ^ 1 = 1. From this and Assumptions l4~072l we deduce 
that \im k ^ +00 (p'(t ij (9 a ^- 1 )/t ij (9 a ^)) = 0. Since {9 a ^} keN converges to 9* and that 
Uj{9*) 7^ 0, we obtain that the subsequence {Uj(9 a ^~ 1 ) /Uj(9 a ^)} ke ^ is bounded from 
above. Moreover, {Vtij(9 a ^)} k& ^ is also bounded by continuous differentiability of ty. 
Therefore, the fact that lim fc ^ +00 ^(t ij (9 a ^- 1 )/t ij (9 a ^)) = establishes Claim A. □ 

Using this claim, we just have to study the remaining right hand side terms in (fT9l) . 

namely the expression 

Eij a v (yj ) vt u ( 9a{k) ) ( u l^Zl) ) 1} ) • Let X ** be a subset of the active 
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indices X such that the family {V£y(#*)}y is linearly independent. This linear indepen- 
dence is preserved under small perturbations, we may assume without loss of generality 

that the family < VtiJ8 a ^) \ is linearly independent for k sufficiently large. For 

I J (ij)ei" 

such k, we may rewrite equation ( TT9l) as 

+ E< i «ii(2/i)%(^ ( * ) )Vi^(0 ff(fc) ,0 <r(fc) - 1 )) = o. 
Claim B. The sequence {\ij(yj)}k&t is bounded. 

Proof of claim B. Using the previous claim and the continuous differentiability of l y 
and tij, equation (120!) expresses that {Kj k \vj)}ij are proportional to the coordinates of 
the projection on the span of the {VUj(0 a ^)}ij of a vector converging towards Vl y {6*). 
Since {Vtij(9 a ^)}ij, for G I**, form a linearly independent family for k sufficiently 
large, none of the coordinates can tend towards infinity. □ 

We are now in position to finish the proof of the theorem. Take any cluster point 
of tij(6 a ^~ l ) /Uj(9 a ^) . Using Claim B, we know that (A^- (yj))(i,j)ex** lies in a compact 
set. Let (A*j)(ij) e x** be a cluster point of this sequence. Passing to the limit, we obtain 
from equation (|T9j) that 

vz,(^ (fc) ) + /r( Kj^uAe*)) = o. 

for every cluster point (3* of {f3 a (k)}keN- For all G X**, set A^- = /3*A*,-. This equation 
is exactly the Karuch-Kuhn- Tucker necessary condition for optimality. □ 

Remark 4.2.2. If the family (V^(r w )) (!j)eI . is linearly independent for k sufficiently 
large, Theorem \4-2.1\ holds and in addition the are nonnegative, which proves that 

6* satisfies the Karush-Kuhn- Tucker conditions when it lies in the closure of T>i . 

5 Application 

The goal of this section is to illustrate the utility of the previous theory for a nonparametric 
survival analysis with competing risks proposed by Ahn, Kodell and Moon in pQ. 

5.1 The problem and the Kullback proximal method 

This problem can be described as follows. Consider a group of iV animals in an animal 
carcinogenecity experiment. Sacrifices are performed at certain prescribed times denoted 
by ii, t2, ■ ■ ■ , t m in order to study the presence of the tumor of interest. Let T\ be the time 
to onset of tumor, To the time to death from this tumor and Xq be the time to death 
from a cause other than this tumor. Notice that Xi, To and Xc are unobservable. The 
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quantities to be estimated are S(t), P(t) and Q(t), the survival function of Ti, To and Xc 
respectively. It is assumed that Ti and T D are statistically independent of Xc- 

A nonparametric approach to estimation of S, P and Q is proposed in PQ: observed 
data yx, . . . ,y n are the number of deaths on every interval {tj, tj + i] which can be classified 
into the following four categories, 

• death with tumor (without knowing cause of death) 

• death without tumor 

• sacrifice with tumor 

• sacrifice without tumor 

This gives rise to a multinomial model whose probability mass is parametrized by the values 
of S, P and Q at times ti, . . . , t m . More precisely, for each time interval (tj, denote by 
Cj the number of deaths with tumor present, by the number of deaths with tumor absent, 
a 2 j the number of sacrifices with tumor present and & 2 j the number of sacrifices with tumor 
absent. Let Nj < N be the number of live animals in the population at tj, it is shown in 
PQ that the corresponding log-likelihood is given by 

\ogg{y, 0) = J2T=i( N i-i ~ N j) Ei=i !og(Pfc<?fc) + («2j + b 2j ) hg(pjqj) 

+ Cj log ((1 - Pj ) + (1 - 7T jPj ){l - fc)) (21) 

+btj log((l - qj)Hj-i) + a 2j log(l - iTj) + b 2j log 7Tj + Cst, 

where Cst is a constant 7Tj = S{tj)/P{tj), pj = P(tj)/P(tj_i) and qj = Q(tj)/Q(tj_i), 
j — 1, . . . ,m, 6 — (ttx, . . . ,pj,Pi, ■ ■ ■ ,pj, qi, . . . , qj) and the parameter space is specified by 
the constraints 

0= jfl = (tti, . . . ,pj,pi, . . . ,pj,qi, • • • ,qj) \ < iij < 1, 

, 

where the last nonconvex constraint serves to impose monotonicity of S. Note that mono- 
tonicity of P and Q is a direct consequence of the constraints on the p/s and the g/s, 
respectively. 

Define the complete data measurement that indicates the cause of 

death in addition to the presence of absence of a tumor in the dead animals. Specifi- 
cally, Xi, . . . ,x n should fall into one of the following categories 

• death caused by tumor and death with incidental tumor 

• death without tumor 

• sacrifice with tumor 

• sacrifice without tumor 
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To each time interval (tj, £7+1] among those animals dying of natural causes, there cor- 
respond the numbers dj of deaths caused by tumor and the number a\j of deaths with 
incidental tumor, neither of which are observable. The associated complete log-likelihood 
function is given by 



log f(x; 9) = Y™=i {Nj-x - Nj) 5X=i l °g(Pk1k) + (a 2j + b 2j ) log(p^) 

+dj log(l - pj) + aij log ((1 - Zjpj)(l - qj)J (23) 
+b lj log((l - 5 i )7r i _i) + a 2j log(l - Kj) + b 2j log it j + Cst 

Now, we have to compute the expectation Q(9, 9) of the log-likelihood function of the 
complete data conditionally to the parameter 9. The random variables dj and a\j are 
binomial with parameter Xj and 1 — Aj where Xj is the probability that the death was 
caused by the tumor conditioned on the presence of the tumor. Conditioned on 9, we have 

3 1 i>j • (1 • ~,/>,!U <l,) 1 

(see [TJ Section 3] for details). From this, we obtain that the conditional mean values of dj 
and are given by 



Therefore 



E[dj j y; 9) — XjCj and Efay | y; 9) — (1 — Xj)cj. (25) 

<W = E?=M-i ~ N j) EC! log( Pk q k ) + {a 2j + b 2j ) log(p jqj ) 

+XjCj log(l - pj) + (1 - Xj)cj log ((1 - n jPj )(l - (26) 
log((l - qj)nj-i) + a 2j log(l - ttj) + 6 2 j log^- + Cst. 

From this, we can easily compute the associated Kullback distance-like function: 



m 



w) = ^mKw)) +t '^W)))' (27) 



with 



t'(9) = f 1 -* and f'(9) = (1 ~ ~ ^ , (28) 

^ ' 1 -pj + (1 - 7r iPi )(l - ?i ) jW 1 - Pj + (1 - 7r iPi )(l - qj) K } 

and is defined by <f>(r) = rlog(r). It is straightforward to verify that Assumptions 12.37X1 
I2T3T21 12X31 and jjjj are satisfied. 

The main computational problem in this example is to handle the difficult nonconvex 
constraints entering the definition of the parameter space 0. The authors of [15] and pQ 
use the Complex Method proposed by Box in |2J to address this problem. However, the 
theoretical convergence properties of Box's method are not known as reported in article 
MRO 184734 in the Math. Reviews. Using our proximal point framework, we are able to 
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easily incorporate the nonconvex constraints into the Kullback distance-like function and 
obtain an efficient algorithm with satisfactory convergence properties. For this purpose, 
let /' be defined by 



where 




Using this new function, the nonconvex constraints TTjPj < 7Tj_i are satisfied for all proximal 
iterations and Assumptions 14.0.21 still hold. 

5.2 Experimental results 

We implemented the Kullback proximal algorithm with different choices of relaxation se- 
quence (/3fc)fcgN, 0k — P- The M-step of the EM algorithm does not have a closed form 
solution, so that nothing is lost by setting (3k to a constant not equal to one. 

We attempted to supplement the KPP-EM algorithm with the Newton method and 
other built-in methods available in Scilab but they were not even able to find local max- 
imizers due to the explosive nature of the logarithms near zero, leading these routines 
to repetitive crashes. To overcome this difficulty, we found it convenient to use the ex- 
tremely simple simulated annealing random search procedure; see [22] for instance. This 
random search approach avoids numerical difficulties encountered using standard opti- 
mization packages and easily handles nonconvex constraints. The a.s. convergence of this 
procedure is well established and recent studies such as [11] confirm the good computational 
efficiency for convex functions optimization. 

Some of our results for the data of Table 1 of [15] are given in Figures 1 to 4. In 
the reported experiments, we chose three constant sequences with respective values (3 n = 
100, 1, .01. We observed the following phenomena 

1. after one hundred iterations the increase in the likelihood function is less than 10~ 5 
except for the case /3 n = 100 (Figure H]) where the algorithm had not converged. 

2. for (3 n = 100 we often obtained the best initial growth of the likelihood 

3. for (3 n = .01 we always obtained the highest likelihood when the number of iterations 
was limited to 50 (see Figure [3] for the case MCL Male AL). 

It was shown in [5] that penalizing with a parameter sequence (/3 n )neN converging 
towards zero implies superlinear convergence in the case where the maximum likelihood 
estimator lies in the interior of the constraint set. Thus, our simulations results seem to 
confirm observation 3. The second observation was surprising to us but this phenomenon 
occured repeatedly in our experiments. This behavior did not occur in our simulations for 
the Poisson inverse problem in [5] for instance. 

In conclusion, this competing risks estimation problem is an interesting test for our 
Kullback-proximal method which shows that the proposed framework can provide prov- 
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ably convergent methods for difficult constrained nonconvex estimation problems for which 
standard optimization algorithms can be hard to tune. The relaxation parameter sequence 
(/3«)neN also appeared crucial for this problem although the choice (3 n = 1 could not really 
be considered unsatisfactory in practice. 



10 15 20 25 30 35 40 45 50 55 



Figure 1: Evolution of the log-likelihood versus iteration number: MCL Female CR case 



6 Conclusions 

The goal of this paper was the study of the asymptotic behavior of the EM algorithm 
and its proximal generalizations. We clarified the analysis by making use of the Kullback- 
proximal theoretical framework. Two of our main contributions are the following. Firstly 
we showed that interior cluster points are stationary points of the likelihood function and 
are local maximizers for sufficiently small values of 0. Secondly, we showed that cluster 
points lying on the boundary satisfy the Karush-Kuhn- Tucker conditions. Such cases were 
very seldom studied in the literature although constrained estimation is a topic of growing 
importance; see for instance the special issue of the Journal of Statistical Planning and 
Inference [TU] which is devoted to the problem of estimation under constraints. On the 
negative side, the analysis from the Kullback-proximal viewpoint allowed us to understand 
why uniqueness of the cluster point is hard to establish theoretically. On the positive side, 
we were able to implement a new and efficient proximal point method for estimation in 
the difficult tumor lethality problem involving nonlinear inequality constraints. 
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Figure 2: Evolution of the log-likelihood versus iteration number: MCL Male AL case 




Figure 3: Evolution of the log-likelihood versus iteration number: Detail of MCL Male AL 

case 
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Figure 4: Evolution of the log-likelihood versus iteration number: MCL Female AL case 
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